Customer Behavioural Analytics in the Retail Sector


Team CuBA

Nadiia Honcharenko (220681, nadiia.honcharenko@st.ovgu.de)
Rutuja Shivraj Pawar (220051, rutuja.pawar@ovgu.de)
Shivani Jadhav (223856, shivani.jadhav@st.ovgu.de)
Sumit Kundu (217453, sumit.kundu@st.ovgu.de)

Under the Guidance of

M.Sc. Uli Niemann

Overview and Motivation

A customer is a key-centric factor for any business to be successful. Conventional wisdom tells us that the cost of retaining an existing customer is far less than acquiring a new one. In order that a business has a sustainable growth, the retention of its old customer base and expansion of the new customer base is very critical. This demands from a business to understand the behaviour of its customers in relation to the business. Therefore obtaining a 360° view of its customers is crucial for a business looking for a competitive edge in the market. In such a scenario, Customer Behavioural Analytics plays an important role in leveraging data analytics to find meaningful behavioural patterns in the customer-specific business data.
Consequently, this project aims to understand the consumer behaviour in the retail sector. Decoding the consumer behaviour will be based on understanding how consumers make purchase decisions and what factors influence those decisions. This project also aims to discover existence of dependencies between customers, products and shops to highlight further insights about their behaviour. These meaningful insights will further help a business to implement strategies leading to an increased revenue through customer satisfaction.

Project Objective

This project aims to address the problem of understanding behaviour of customers of an Italian retail distribution company Coop in a single Italian city. The project intends to discover different analytical insights about the purchase behaviour of the customers through answering different formulated Research Questions (RQ)

Data Source

Supermarket aggr.Customer1
The dataset to be used is the retail market data of one of the largest Italian retail distribution company called Coop for a single Italian city.
The Supermarket aggr.Customer dataset used for the analysis contains data aggregated from customer and information from shops2 (Pennacchioli et al. 2013) and pivoted to new columns. The dataset thus contains 40 features with 60,366 instances and is approximately 14.0 MB in size.

Initial Questions

Below are the RQs which were formulated at the initial stages of the project based on a primary understanding of the data but without a detailed Exploratory Data Analysis,

1. Are customers willing to travel long distances to purchase products in spite of the high average product price in a shop?
Relevance: This will help to understand whether the price is an important factor affecting the majority of customers purchase decisions.

2. Which are the products for which customers are ready to travel long distances and for which products they select the closest shop?
Relevance: This will help to understand the nature of the products in the context of proximity. It is assumed that customers will select closest shops to buy daily products like grocery but may travel long distances to buy one-time-purchase products like kitchen equipment. As Data Science is results-driven and not based solely on intuition, this question can help to verify this assumption.

3. What is the maximum likelihood of a customer to select a particular shop to purchase a particular product?
Relevance: This will help to understand that which shops in the retail chain are in demand for a particular product. This can further facilitate better stock management to meet customer demands.

4. What is the ranking of the shops in terms of attracting more customers and thus generating more revenue and what is their individual customer base?
Relevance: This will help to understand the most popular shops in the retail chain and target different shop-level marketing schemes to the appropriate customers through customer segmentation.

5. Which are the customers that are most profitable in terms of revenue generation?
Relevance: This will help to understand the customers with potential high Customer Lifetime Value and target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.

Data Wrangling

Data Wrangling consists of different steps transforming data from raw form into a clean form which is appropriate and accurate for data analysis. Below are the different steps which were carried out as a part of Data Wrangling,

Examination of Input Dataset

Visualize the input dataset

library(tidyverse)
library(DataExplorer)

# Read data from the input csv file
filepath<- "Input Dataset/Supermarket aggr.Customer.csv"
supermarket_data <- read_csv(filepath)

# Converts data to tbl class. as tbl’s are easier to examine than data frames and View data set in spreadsheet-like display
supermarket_tbl<-tbl_df(supermarket_data)
# Check the dimension of the input dataset and the variables through a plot
plot_str(supermarket_tbl)

Generate the input dataset statistics

# Data Statistics
gather(introduce(supermarket_tbl))

Generate the input dataset summary

# Data Summary, p0 = min value, p100 = max value
library(skimr)
skim_with(integer = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim_with(numeric = list(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL))
skim(supermarket_tbl) %>%  skimr::kable()
## Skim summary statistics  
##  n obs: 60366    
##  n variables: 40    
## 
## Variable type: integer
## 
##              variable                missing    complete      n       mean         sd       p0    p100  
## ----------------------------------  ---------  ----------  -------  ---------  ----------  ----  -------
##            customer_id                  0        60366      60366    30183.5    17426.31    1     60366 
##         products_purchased              1        60365      60366    1778.71    2185.05     1     22131 
##     products_purchased_shop_1           0        60366      60366    887.81     1438.48     0     17016 
##     products_purchased_shop_2           0        60366      60366    605.37     1382.14     0     22110 
##     products_purchased_shop_3           0        60366      60366    156.68      723.82     0     16913 
##     products_purchased_shop_4           0        60366      60366     56.65      455.1      0     17445 
##     products_purchased_shop_5           0        60366      60366     72.18      504.96     0     20891 
##             shops_used                  1        60365      60366     2.38        1.01      1       5   
##     unique_products_purchased           1        60365      60366    330.67       236       1     1465  
##  unique_products_purchased_shop_1       0        60366      60366     222.3      213.98     0     1459  
##  unique_products_purchased_shop_2       0        60366      60366    126.61      172.33     0     1161  
##  unique_products_purchased_shop_3       0        60366      60366     31.91      88.53      0     1048  
##  unique_products_purchased_shop_4       0        60366      60366     11.57      52.26      0      846  
##  unique_products_purchased_shop_5       0        60366      60366     15.77      61.93      0      841  
## 
## Variable type: numeric
## 
##         variable            missing    complete      n       mean        sd        p0        p100   
## -------------------------  ---------  ----------  -------  ---------  ---------  -------  ----------
##     amount_purchased           1        60365      60366    4235.49    5006.78    0.21     51588.66 
##  amount_purchased_shop_1       0        60366      60366    2310.43    3509.99      0      47839.38 
##  amount_purchased_shop_2       0        60366      60366    1373.29    3104.61      0      44033.96 
##  amount_purchased_shop_3       0        60366      60366    303.26     1391.14      0      31246.35 
##  amount_purchased_shop_4       0        60366      60366    110.31     880.51       0      29628.03 
##  amount_purchased_shop_5       0        60366      60366    138.13     960.49       0      36819.42 
##   avg_distance_to_shops        1        60365      60366    2030.23    1119.91    6.64     9004.16  
##         avg_price              1        60365      60366     3.67       9.13      0.21      787.57  
##     avg_price_shop_1           0        60366      60366     4.68       15.96       0       787.57  
##     avg_price_shop_2           0        60366      60366      2.3       9.97        0       787.57  
##     avg_price_shop_3           0        60366      60366     0.76        1.9        0       263.23  
##     avg_price_shop_4           0        60366      60366     0.44       4.39        0       522.61  
##     avg_price_shop_5           0        60366      60366     0.45       1.33        0       133.97  
##       avg_purchase             1        60365      60366     8.53       10.32     0.21      787.57  
##    avg_purchase_shop_1         0        60366      60366      8.1       16.54       0       787.57  
##    avg_purchase_shop_2         0        60366      60366     4.97       11.23       0       787.57  
##    avg_purchase_shop_3         0        60366      60366     1.52       3.69        0       263.23  
##    avg_purchase_shop_4         0        60366      60366     0.76        4.9        0       522.61  
##    avg_purchase_shop_5         0        60366      60366     0.81        2.6        0       133.97  
##      distance_shop_1           0        60366      60366    2496.63    1281.53    93.28    8019.92  
##      distance_shop_2           0        60366      60366    2488.24    1417.36    11.19    9004.16  
##      distance_shop_3           0        60366      60366    1924.97    1157.75    17.84    7395.25  
##      distance_shop_4           0        60366      60366    2882.8     1730.09    6.64     9273.69  
##      distance_shop_5           0        60366      60366    2020.9     1260.85    25.46    7465.81  
##   max_distance_to_shops        1        60365      60366    2942.67    1327.53    6.64      9267.7  
##   min_distance_to_shops        1        60365      60366    1396.44    1048.42    6.64     9004.16

Dataset Cleaning and Processing

Eliminate the mising values in the input dataset

# Eliminate the missing values in the dataset
supermarket_tbl_Clean1<-na.omit(supermarket_tbl)
na.action(supermarket_tbl_Clean1)
## 44617 
## 44617 
## attr(,"class")
## [1] "omit"
# Percentage of data missing (Should be 0%) 
missing_data_count = sum(!complete.cases(supermarket_tbl_Clean1))
total_data = dim(supermarket_tbl_Clean1)[1]
missing_data_percent = (missing_data_count/total_data) * 100
missing_data_percent
## [1] 0

Eliminate the duplicate rows in the input dataset

# Eliminate duplicate rows
distinct(supermarket_tbl_Clean1)

Round the decimal values in the input dataset

# Round the decimal value columns upto 4 decimal places
is.num <- sapply(supermarket_tbl_Clean1, is.numeric)
supermarket_tbl_Clean1[is.num] <- lapply(supermarket_tbl_Clean1[is.num], round, 4)

Rename the column names in the input dataset

# Rename column names

## From products_purchased to products_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'products_purchased'] <- 'products_purchased_total'

## From shops_used to shops_used_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'shops_used'] <- 'shops_used_total'

## From amount_purchased to amount_purchased_total
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'amount_purchased'] <- 'amount_purchased_total'

## From min_distance_to_shops to min_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'min_distance_to_shops'] <- 'min_dist_to_custSel_shops'

## From max_distance_to_shops to max_dist_to_custSel_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'max_distance_to_shops'] <- 'max_dist_to_custSel_shops'

## From unique_products_purchased to unique_products_purchased_total_exclCommon
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'unique_products_purchased'] <- 'unique_products_purchased_total_exclCommon'

## From avg_distance_to_shops to avg_distance_to_all_shops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_distance_to_shops'] <- 'avg_distance_to_all_shops'

## From avg_price_shop_1 to avg_product_price_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_1'] <- 'avg_product_price_shop_1'

## From avg_price_shop_2 to avg_product_price_shop_2
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_2'] <- 'avg_product_price_shop_2'

## From avg_price_shop_3 to avg_product_price_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_3'] <- 'avg_product_price_shop_3'

## From avg_price_shop_4 to avg_product_price_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_4'] <- 'avg_product_price_shop_4'

## From avg_price_shop_5 to avg_product_price_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price_shop_5'] <- 'avg_product_price_shop_5'

## From avg_price to avg_purchased_product_price_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_price'] <- 'avg_purchased_product_price_allShops'

## From avg_purchase_shop_1 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_1'] <- 'avg_purchase_amount_shop_1'

## From avg_purchase_shop_2 to avg_purchase_amount_shop_1
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_2'] <- 'avg_purchase_amount_shop_2'

## From avg_purchase_shop_3 to avg_purchase_amount_shop_3
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_3'] <- 'avg_purchase_amount_shop_3'

## From avg_purchase_shop_4 to avg_purchase_amount_shop_4
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_4'] <- 'avg_purchase_amount_shop_4'

## From avg_purchase_shop_5 to avg_purchase_amount_shop_5
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase_shop_5'] <- 'avg_purchase_amount_shop_5'

## From avg_purchase to avg_purchase_amount_allShops
names(supermarket_tbl_Clean1)[names(supermarket_tbl_Clean1) == 'avg_purchase'] <- 'avg_purchase_amount_allShops'

Reorder the columns in the input dataset

# Reorder Columns
supermarket_tbl_Clean1 <- supermarket_tbl_Clean1[c(1,10,11,12,13,14,15,3,4,2,16,17,18,19,20,5,21,22,23,24,25,6,36,37,38,39,40,9,26,27,28,29,30,7,31,32,33,34,35,8)]

Write the cleaned dataset tbl to a CSV file

# Write the cleaned data tbl to csv
clean_filepath = "~/R GitHub/Data-Science-with-R/Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
write.csv(supermarket_tbl_Clean1, file = clean_filepath, row.names = FALSE)

Exploration of Cleaned Dataset

Visualize the cleaned dataset

# Check the dimension of the cleaned dataset and the variables
plot_str(supermarket_tbl_Clean1)

Generate the cleaned dataset statistics

# Cleaned data Statistics 
gather(introduce(supermarket_tbl_Clean1))

Analyze the Continuous Variables in the cleaned dataset as a Histogram

# Analyze Continuous Variables in the cleaned dataset (Univariate Analysis)
plot_histogram(supermarket_tbl_Clean1)

Interpretation: The histograms plotted depict the distribution of each continuous variable in the dataset. These plots can be used to understand the data spread, whether the data is symmetric or skewed and graphically summarize the univariate dataset distribution.

Examine the corelated features in the cleaned dataset through Correlation Analysis

# Correlation analysis (Multivariate Analysis, On Continuous features only) to examine corelated features in the cleaned dataset
library(ggcorrplot)
corr <- round(cor(supermarket_tbl_Clean1), 1)

ggcorrplot(corr, outline.col = "white") +geom_tile(height=1.8, width=1.8) +
  scale_fill_gradient2(low="blue", mid="white", high="red") +
  theme_minimal() +
  coord_equal() +
  labs(x="",y="",fill="Correlation coefficient") +
  theme(axis.text.x=element_text(size=7, angle=90, vjust=1, hjust=1, 
                                 margin=margin(-3,0,0,0)),
        axis.text.y=element_text(size=7, margin=margin(0,-3,0,0)),
        panel.grid.major=element_blank()) 

Interpretation: The Correlation plot depicts the association between the variables in the dataset and the degree of association between them is displayed by the variation in the correlation coefficient color. The plot helps to understand the relationship between the different variables in the dataset.

Visualize the deviation from a specific probability distribution in the cleaned dataset through Quantile-Quantile plot

# Quantile-Quantile plot to visualize the deviation from a specific probability distribution in the cleaned dataset
plot_qq(supermarket_tbl_Clean1)

Interpretation: The Quantile-Quantile plot compares two probability distributions through plotting their quantiles against each other. This plot depicts if both set of quantiles are from the same distribution (points form a roughly straight line), helping to further understand the data distribution.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of visualizing the main characteristics in the data before the formal modeling on the data to discover data patterns and verify the initial primary assumptions made on the data. Below are the visualizations of the EDA carried out,

1) Visualize the highest revenue generating shops, shops selling the highest number of products, shops selling the highest number of unique products and the relation between them

library(ggplot2)
library(RColorBrewer)
library(scales)

file_path<- "Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv"
supermarket_data_clean <- read_csv(file_path)

Shop<- c(1,2,3,4,5)

# Revenue generation by Shops 1-5
# columns selected: amount_purchased_shop_1, 2, 3, 4, 5
slice1<-select(supermarket_data_clean, 29,30,31,32,33)

amountS1<-sum(slice1$amount_purchased_shop_1)
amountS2<-sum(slice1$amount_purchased_shop_2)
amountS3<-sum(slice1$amount_purchased_shop_3)
amountS4<-sum(slice1$amount_purchased_shop_4)
amountS5<-sum(slice1$amount_purchased_shop_5)

# create data frame for Revenue generation
Revenue_Generated<- c(amountS1,amountS2,amountS3,amountS4,amountS5)
Revenue<- data.frame(Shop, Revenue_Generated)
rownames(Revenue) <- NULL

# Products Sold by Shops 1-5
# columns selected: products_purchased_shop_1, 2, 3, 4, 5
slice2<-select(supermarket_data_clean, 11,12,13,14,15)

productsS1<-sum(slice2$products_purchased_shop_1)
productsS2<-sum(slice2$products_purchased_shop_2)
productsS3<-sum(slice2$products_purchased_shop_3)
productsS4<-sum(slice2$products_purchased_shop_4)
productsS5<-sum(slice2$products_purchased_shop_5)

# create data frame for Products Sold
Products_Purchased<- c(productsS1,productsS2,productsS3,productsS4,productsS5)
ProductsSold<- data.frame(Shop, Products_Purchased)
rownames(ProductsSold) <- NULL

# Unique products Sold by Shops 1-5
# columns selected: unique_products_purchased_shop_1,2,3,4,5
slice3<-select(supermarket_data_clean, 17,18,19,20,21)

uproductsS1<-sum(slice3$unique_products_purchased_shop_1)
uproductsS2<-sum(slice3$unique_products_purchased_shop_2)
uproductsS3<-sum(slice3$unique_products_purchased_shop_3)
uproductsS4<-sum(slice3$unique_products_purchased_shop_4)
uproductsS5<-sum(slice3$unique_products_purchased_shop_5)

# create data frame for Unique products Sold
UProducts_Purchased<- c(uproductsS1,uproductsS2,uproductsS3,uproductsS4,uproductsS5)
UProductsSold<- data.frame(Shop, UProducts_Purchased)
rownames(UProductsSold) <- NULL

# Plot a Bar graph to depict the above calculated data
Legends <-c(rep("Revenue Generated", 5), rep("Products Sold", 5), rep("Unique Products Sold", 5))
values <-c(Revenue_Generated, Products_Purchased, UProducts_Purchased)
mydata <-data.frame(Shop, values)

p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", aes(fill = Legends), position = "dodge") +
  xlab("Shop") + ylab("Total") +
  ggtitle("Relation between Revenue and Products Sold") +
  theme_bw() + scale_y_continuous(labels = scales::comma)

Analysis: As visualized, the shops ordered based on the their highest Revenue Generated are Shop 1, 2, 3, 5, 4. The shops ordered based on the their highest amount of Products Sold are Shop 1, 2, 3, 5, 4. The shops ordered based on the their highest amount of Unique Products Sold are Shop 1, 2, 3, 4 & 5 (are on the same level). The relation between these parameters as visualized based on the shop order can be determined as the shop generating the highest revenue has the highest amount of products sold (unique included). So in the dataset, the relation between the revenue generated and the products sold is directly proportional to each other. The ordering of the shops is mostly stable here at Shop 1, 2, 3, 5, 4.

2) Visualize the approximate customer base count for the different shops

# Approximate Customer Base for Shops 1-5

C1<-slice2$products_purchased_shop_1
custS1<-length(which(C1 !=0))

C2<-slice2$products_purchased_shop_2
custS2<-length(which(C2 !=0))

C3<-slice2$products_purchased_shop_3
custS3<-length(which(C3 !=0))

C4<-slice2$products_purchased_shop_4
custS4<-length(which(C4 !=0))

C5<-slice2$products_purchased_shop_5
custS5<-length(which(C5 !=0))

# create data frame for Approximate Customer Base
Customers<- c(custS1,custS2,custS3,custS4,custS5)
TotalCustomers<- data.frame(Shop, Customers)
rownames(TotalCustomers) <- NULL

# Plot a Bar graph to depict the approximate Customer Base
values <-c(Customers)
mydata <-data.frame(Shop, values)

p <-ggplot(mydata, aes(Shop, values))
p +geom_bar(stat = "identity", fill = "gray" , position = "dodge", color = "black") +
  xlab("Shop") + ylab("Total") +
  ggtitle("Customer Base") + 
  theme_bw()

Analysis: As visualized, the shops ordered based on the their highest approximate customer base are Shop 1, 2, 3, 5, 4. So the highest approximate customer base for a shop determines its popularity in terms of a customer’s product purchase from the shop. The ordering of the shops is here as Shop 1, 2, 3, 5, 4.

3) Visualize the relationship between average prices and distances to the shop

library(modelr)
library(gridExtra)

cleared_supermarket_data<-read_csv(file_path)
cleared_supermarked_tbl <- tbl_df(cleared_supermarket_data)

shop_ordered_slice <- select(cleared_supermarked_tbl, 3,23,4,24,5,25,6,26,7,27) %>% 
  bind_cols(cleared_supermarked_tbl[,8:10], cleared_supermarked_tbl[,28])

# Splitting the data
get_slice_for_shop <- function(col1, col2){
  shop_slice <- shop_ordered_slice[,col1:col2]
  colnames(shop_slice) <- c("distance","price")
  return(shop_slice)
}

shop_1_data <- get_slice_for_shop(1,2)
shop_2_data <- get_slice_for_shop(3,4)
shop_3_data <- get_slice_for_shop(5,6)
shop_4_data <- get_slice_for_shop(7,8)
shop_5_data <- get_slice_for_shop(9,10)
shop_avg_data <- get_slice_for_shop(13,14)
shop_agg_min_data <- get_slice_for_shop(11,14)
shop_agg_max_data <- get_slice_for_shop(12,14)

# Combine data to the one mutated table to show all shops at the one graph
joined_shops_data <- mutate(shop_1_data, Shop="1") %>%
  union_all(mutate(shop_2_data, Shop="2")) %>%
  union_all(mutate(shop_3_data, Shop="3")) %>%
  union_all(mutate(shop_4_data, Shop="4")) %>%
  union_all(mutate(shop_5_data, Shop="5")) 

# Create base for plots
get_base_for_plot <- function(dataset, caption){
  plot_base <- ggplot(data = dataset, mapping = aes(x = distance, y = price)) + ggtitle(caption)
  return(plot_base)
}

# Colours list
colours_shema <- c("Red", "Green", "Yellow", "Pink", "Blue", "Purple", "steelblue1", "tomato1")

#create scatter plot
add_geom_point <- function(colorNum){
  geom_p <- geom_point(colour=colours_shema[colorNum], alpha=0.3)
  return(geom_p)
}

#draw scatter plot with plot base
draw_cov_point_plot <- function(dataset, colorNum, caption){
  cov_geom_plot <- get_base_for_plot(dataset, caption) + add_geom_point(colorNum) +       
  scale_y_continuous(trans="log2")+ 
  geom_smooth(stat = 'smooth', color = 'Black', method = 'gam', formula = y ~ s(x, bs = "cs")) 
  return(cov_geom_plot)
} 

p1_1 <- draw_cov_point_plot(shop_1_data, 1, "Shop 1")
p2_1 <- draw_cov_point_plot(shop_2_data, 2, "Shop 2")
p3_1 <- draw_cov_point_plot(shop_3_data, 3, "Shop 3")
p4_1 <- draw_cov_point_plot(shop_4_data, 4, "Shop 4")
p5_1 <- draw_cov_point_plot(shop_5_data, 5, "Shop 5")
pavg_1 <- draw_cov_point_plot(shop_avg_data, 6, "Average price with average distance")
pmin_1 <- draw_cov_point_plot(shop_agg_min_data, 7, "Average price with min distance")
pmax_1 <- draw_cov_point_plot(shop_agg_max_data, 8, "Average price with max distance")

pall_1 <- get_base_for_plot(joined_shops_data, "All shops") + geom_point(mapping = aes(colour = Shop), alpha=0.3) 

comb_cov_shops <- grid.arrange(p1_1, p2_1, p3_1, p4_1, p5_1, 
                               nrow=2, ncol=3, 
                               top="Covariation between distances and average prices")

comb_cov_aggrs <- grid.arrange(pmin_1, pmax_1,
                               nrow=2,
                               top= "Covariation between min/max distances and average prices")

comb_cov_avg <- grid.arrange(pall_1, pavg_1,
                               nrow=2,
                               top= "Covariation between distances and average prices (aggregated)")

Analysis: There are strong dependencies between long average distance and the average price in a shop. Also the average price is close enough to zero, therefore it makes a sense to check for zero values for price in the current dataset and its influence.

4) Visualize data gap for average price in each shop

prices <- shop_ordered_slice[,seq(2, 10 ,2)]
names(prices) <- c("Shop 1", "Shop 2", "Shop 3", "Shop 4", "Shop 5")
dataset_with_na <- data.frame(sapply(prices, function(x) { 
    na_if(x,0)
            } ))
pavg_2 <- plot_missing(dataset_with_na) + ggtitle("Gaps in shop prices")

Analysis: As visualized, there is a data gap for the value of average price in each shop. The value for average price in a shop for a customer is only filled in the dataset if the particular customer prefers the shop, else it is left as zero. This can be considered as a data gap but at the same time it is an information which is never utilized during analysis, meaning that the customer does not choose that particular shop in the first place. But this data gap does not affect on the tendency of the relationship between price and distance.

5) Visualize the distribution density of the average price across shops

# Normilizing of average price for common average data
dt_avg <- transform(shop_avg_data, price = pnorm(price))

pavg_3 <- ggplot(data = dt_avg, mapping = aes(x = price, y = ..density..)) +
  geom_freqpoly(colour=colours_shema[6], binwidth = 0.1) + ggtitle("Average price distribution")

# Normilizing of average price for shops
dt_all <- transform(joined_shops_data, price = pnorm(price))
pall_2 <- ggplot(data = dt_all, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = Shop), binwidth = 0.1) + ggtitle("Common average price distribution")

comb_vis_distibution <- grid.arrange(pavg_3, pall_2,
                                     nrow=2, ncol=1, 
                                     top="Distribution of average price across shops")

Analysis: As visualized, according to the data the average price is varyingly distributed in each shop.

6) Visualize patterns for the average price

joined_shops_without_null <- filter(joined_shops_data, price != 0)
mod <- lm(log(price) ~ log(distance), data = joined_shops_without_null)

joined_shops_data2 <- joined_shops_without_null %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))


pall_3 <- ggplot(data = joined_shops_data2) + 
  geom_point(mapping = aes(x = price, y = resid), colour=colours_shema[6], alpha=0.3) + ggtitle("Average price pattern")

pall_3

pall_4 <- ggplot(data = joined_shops_data2) + 
  geom_boxplot(mapping = aes(x = Shop, y = resid), color=colours_shema[6]) + ggtitle("Average price pattern")
pall_4

Analysis: As visualized, the residuals gave us a view of the average price after removing the distance effect. Once the strong relationship between distance and price has been removed, relationship to other external factors become noticable.

7) Visualize the customer base for each shop based on different customer segments

## Ratio of products purchased from each shop (product purchased by a customer at a shop over total produccts purchased by that customer)
prod_purch_ratio_shop_1 <- with(supermarket_data_clean, products_purchased_shop_1/products_purchased_total)
prod_purch_ratio_shop_2 <- with(supermarket_data_clean, products_purchased_shop_2/products_purchased_total)
prod_purch_ratio_shop_3 <- with(supermarket_data_clean, products_purchased_shop_3/products_purchased_total)
prod_purch_ratio_shop_4 <- with(supermarket_data_clean, products_purchased_shop_4/products_purchased_total)
prod_purch_ratio_shop_5 <- with(supermarket_data_clean, products_purchased_shop_5/products_purchased_total)

## Ratio of amount spent at each shop (amount spent by a customer at a shop over total amount spent by that customer)
amt_purch_ratio_shop_1 <- with(supermarket_data_clean, amount_purchased_shop_1/amount_purchased_total)
amt_purch_ratio_shop_2 <- with(supermarket_data_clean, amount_purchased_shop_2/amount_purchased_total)
amt_purch_ratio_shop_3 <- with(supermarket_data_clean, amount_purchased_shop_3/amount_purchased_total)
amt_purch_ratio_shop_4 <- with(supermarket_data_clean, amount_purchased_shop_4/amount_purchased_total)
amt_purch_ratio_shop_5 <- with(supermarket_data_clean, amount_purchased_shop_5/amount_purchased_total)

## Loyality of customer to a shop (Ratio of products purchased from each shop times Ratio of amount spent at each shop)
loyality_shop_1 <- prod_purch_ratio_shop_1 * amt_purch_ratio_shop_1
loyality_shop_2 <- prod_purch_ratio_shop_2 * amt_purch_ratio_shop_2
loyality_shop_3 <- prod_purch_ratio_shop_3 * amt_purch_ratio_shop_3
loyality_shop_4 <- prod_purch_ratio_shop_4 * amt_purch_ratio_shop_4
loyality_shop_5 <- prod_purch_ratio_shop_5 * amt_purch_ratio_shop_5


## Data frames based on calculated values
prod_purch_ratio_shop <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_1, prod_purch_ratio_shop_2, prod_purch_ratio_shop_3, prod_purch_ratio_shop_4, prod_purch_ratio_shop_5)
amt_purch_ratio_shop <- data.frame(supermarket_data_clean$customer_id, amt_purch_ratio_shop_1, amt_purch_ratio_shop_2, amt_purch_ratio_shop_3, amt_purch_ratio_shop_4, amt_purch_ratio_shop_5)
loyality <- data.frame(supermarket_data_clean$customer_id, loyality_shop_1, loyality_shop_2, loyality_shop_3, loyality_shop_4, loyality_shop_5)

## Data frames as per shop nos
shop_1 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_1, amt_purch_ratio_shop_1, loyality_shop_1)
shop_2 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_2, amt_purch_ratio_shop_2, loyality_shop_2)
shop_3 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_3, amt_purch_ratio_shop_3, loyality_shop_3)
shop_4 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_4, amt_purch_ratio_shop_4, loyality_shop_4)
shop_5 <- data.frame(supermarket_data_clean$customer_id, prod_purch_ratio_shop_5, amt_purch_ratio_shop_5, loyality_shop_5)


## Customer segments for Shop 1 based on Loyality Score
high_end_shop_1 <- shop_1[which(shop_1$loyality_shop_1 > 0.9), ]
mid_range_shop_1 <- shop_1[which(shop_1$loyality_shop_1 > 0.50 & shop_1$loyality_shop_1 <= 0.9), ]
low_end_shop_1 <- shop_1[which(shop_1$loyality_shop_1 > 0 & shop_1$loyality_shop_1 <= 0.5), ]

## Customer segments for Shop 2 based on Loyality Score
high_end_shop_2 <- shop_2[which(shop_2$loyality_shop_2 > 0.9), ]
mid_range_shop_2 <- shop_2[which(shop_2$loyality_shop_2 > 0.50 & shop_2$loyality_shop_2 <= 0.9), ]
low_end_shop_2 <- shop_2[which(shop_2$loyality_shop_2 > 0 & shop_2$loyality_shop_2 <= 0.5), ]

## Customer segments for Shop 3 based on Loyality Score
high_end_shop_3 <- shop_3[which(shop_3$loyality_shop_3 > 0.9), ]
mid_range_shop_3 <- shop_3[which(shop_3$loyality_shop_3 > 0.50 & shop_3$loyality_shop_3 <= 0.9), ]
low_end_shop_3 <- shop_3[which(shop_3$loyality_shop_3 > 0 & shop_3$loyality_shop_3 <= 0.5), ]

## Customer segments for Shop 4 based on Loyality Score
high_end_shop_4 <- shop_4[which(shop_4$loyality_shop_4 > 0.9), ]
mid_range_shop_4 <- shop_4[which(shop_4$loyality_shop_4 > 0.50 & shop_4$loyality_shop_4 <= 0.9), ]
low_end_shop_4 <- shop_4[which(shop_4$loyality_shop_4 > 0 & shop_4$loyality_shop_4 <= 0.5), ]

## Customer segments for Shop 5 based on Loyality Score
high_end_shop_5 <- shop_5[which(shop_5$loyality_shop_5 > 0.9), ]
mid_range_shop_5 <- shop_5[which(shop_5$loyality_shop_5 > 0.50 & shop_5$loyality_shop_5 <= 0.9), ]
low_end_shop_5 <- shop_5[which(shop_5$loyality_shop_5 > 0 & shop_5$loyality_shop_5 <= 0.5), ]


## Number of customers in each segment for Shop 1
count_high_end_shop_1 <- nrow(high_end_shop_1)
count_mid_range_shop_1 <- nrow(mid_range_shop_1)
count_low_end_shop_1 <- nrow(low_end_shop_1)

## Number of customers in each segment for Shop 2
count_high_end_shop_2 <- nrow(high_end_shop_2)
count_mid_range_shop_2 <- nrow(mid_range_shop_2)
count_low_end_shop_2 <- nrow(low_end_shop_2)

## Number of customers in each segment for Shop 3
count_high_end_shop_3 <- nrow(high_end_shop_3)
count_mid_range_shop_3 <- nrow(mid_range_shop_3)
count_low_end_shop_3 <- nrow(low_end_shop_3)

## Number of customers in each segment for Shop 4
count_high_end_shop_4 <- nrow(high_end_shop_4)
count_mid_range_shop_4 <- nrow(mid_range_shop_4)
count_low_end_shop_4 <- nrow(low_end_shop_4)

## Number of customers in each segment for Shop 5
count_high_end_shop_5 <- nrow(high_end_shop_5)
count_mid_range_shop_5 <- nrow(mid_range_shop_5)
count_low_end_shop_5 <- nrow(low_end_shop_5)

## Setting up values
shops <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5)

customer_segments <- c("high", "med", "low", "high", "med", "low", "high", "med", "low", "high", "med", "low", "high", "med", "low")

count_shop_1 <- c(count_high_end_shop_1, count_mid_range_shop_1, count_low_end_shop_1)
count_shop_2 <- c(count_high_end_shop_2, count_mid_range_shop_2, count_low_end_shop_2)
count_shop_3 <- c(count_high_end_shop_3, count_mid_range_shop_3, count_low_end_shop_3)
count_shop_4 <- c(count_high_end_shop_4, count_mid_range_shop_4, count_low_end_shop_4)
count_shop_5 <- c(count_high_end_shop_5, count_mid_range_shop_5, count_low_end_shop_5)

counts <- c(count_shop_1, count_shop_2, count_shop_3, count_shop_4, count_shop_5)


## Count of customers based on the three customer segments (high, mid, low)
shop_customer_segment_count <-data.frame(shops, customer_segments, counts)

## Change datatype of columns to numerical
shop_customer_segment_count <- shop_customer_segment_count %>% mutate_if(is.integer, as.numeric) %>% mutate_if(is.factor, as.numeric)

## Plot the count of customers for each shop based on customer segments (high - 1, mid - 3, low - 2)
qplot(shops, counts, data = shop_customer_segment_count, size=counts, color = customer_segments) + 
    scale_size_continuous(range = c(3, 8))

## Plot the count of customers for each segment
qplot(customer_segments, counts, data = shop_customer_segment_count, size=counts, color = counts) + 
    scale_size_continuous(range = c(3, 8))

Analysis: As visualized, the customers can be categorized on the basis of three different customer segements i.e. high, mid and low in terms of their loyality towards the shops and promotional offers can be customised for each segment.

Final Research Questions

Exploratory Data Analysis provided a feasibility check on the Initial RQs formulated. The EDA phase helped to get a better understanding of the data in relation with the project objective. Hence this lead to the modification, removal or addition of new RQ. Below is the final set of RQ formulated which will be answered through this project,

1. Are customers willing to travel long distances to purchase products in spite of the high average product price in a shop?
Relevance: This will help to understand whether the price is an important factor affecting the majority of customers purchase decisions.

2. What are the factors that contribute towards the long distance travel of the customer to purchase products in spite of the high average product price in a shop?
Relevance: This will help to understand the important factors that contribute towards the customers willing to travel long-distances to purchase products in spite of the high average product price in a shop, in turn better understanding the purchase behaviour of the customers

3. What is the maximum likelihood of a customer to select a particular shop?
Relevance: This will help to understand that which shops in the retail chain are in demand and predict their estimated customer base. This can further facilitate better stock management to meet customer demands.

4. What are the different customer segments based on their purchase behaviour?
Relevance: This will help to understand the groups of similar customer based on their purchase behviour and target different shop-level marketing schemes to the appropriate customers.

5. Which are the Top 100 customers that are most profitable in terms of revenue generation?
Relevance: This will help to understand the customers with potential high Customer Lifetime Value and target appropriate loyalty programs to generate satisfied loyal customers as advocates for the business.

Final Analysis

1. Are customers willing to travel long distances to purchase products in spite of the high average product price in a shop?

Algorithms selected:

Reason for Algorithm Selections:

Analysis

Observations: Thus, the research question is answered by

Business Utilization: These insights obtained can be further utilized by the business to

2. What are the factors that contribute towards the long distance travel of the customer to purchase products in spite of the high average product price in a shop?

Algorithms selected:

Reason for Algorithm Selections:

Analysis

Observations: Thus, the research question is answered by

Business Utilization: These insights obtained can be further utilized by the business to

3. What is the maximum likelihood of a customer to select a particular shop?

Algorithms selected:

Reason for Algorithm Selections:

Analysis

Observations: Thus, the research question is answered by

Business Utilization: These insights obtained can be further utilized by the business to

4. What are the different customer segments based on their purchase behaviour?

Algorithms selected: K-means, Pricipal Component Analysis (PCA)

Reason for Algorithm Selections: K-means clustering is a very simple and fast algorithm. It is the popular method used for customer segmentation and specially for numerical data. K-means also has computational advantages in terms of it scaling well with large datasets. Comparing with hierarchical and model based clustering methods, they were not able to function on the large input dataset being used due to their requirements to calculate a full distance matrix exhibiting limited scalability and big memory requirements. Considering these facts and given that the input dataset is large and mainly contains numerical data, k-means was an ideal choice for customer segmentation.

PCA is a dimensionality reduction algorithm which visualizes the essence of the dataset through data decomposition and transformation into principal components (PC) maximizing the linear variance of the data (More variance indicates more understanding of the data). PCA is used here for this research question as a valuable cross-check to k-means cluster determination.

Analysis

1) Estimating the optimal number of clusters

K-means requires to specify the number of clusters prior to the algorithm start. Determining the optimal number of clusters is crucial to output better results.

library(cluster)
library(factoextra)
library("metricsgraphics")

# Read file contents
supermarket_data_clean <- read.csv("Input Dataset/Cleaned Dataset/Supermarket_DataCleaned.csv")

# Prepare data frames for clustering
# Select only the rows customer_id, amount_purchased_shop_1, 2, 3, 4, 5
cluster.slice.temp <- supermarket_data_clean[,c(1,29,30,31,32,33)]
# Remover customer_id from the clustering data frame 
cluster.slice.data <- supermarket_data_clean[,c(29,30,31,32,33)]

# Scale the data and Determine the ideal number of clusters
cluster.slice.scale <- scale(cluster.slice.data)

wssplot <- function(data, nc=15, seed=1234){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")}

wssplot(cluster.slice.scale)

The above plot depicts a sharp decrease from values 1 to 4 for number of clusters with a slight decrease from 4 to 5 which estimates a 4-cluster or 5-cluster solution.

2) Perform K-means clustering with number of clusters as 4 and 5

# Perform k-means on cluster values as 4 and 5

set.seed(123) # fix the random starting clusters
kclust4 <- kmeans(cluster.slice.data, 4, nstart = 25)

set.seed(123) # fix the random starting clusters
kclust5 <- kmeans(cluster.slice.data, 5, nstart = 25)

3) Perform PCA to visualize the clusters

PCA with 5 cluster K-means

cluster.pc5 <- prcomp(cluster.slice.data, center = FALSE, scale. = FALSE)$x %>% as.data.frame()
cluster.pc5$kmeans.cluster <- kclust5$cluster

mjs_plot(cluster.pc5, x=PC1, y=PC2) %>%
  mjs_point(color_accessor=kmeans.cluster) %>%
  mjs_labs(x="Principal Component 1", y="Principal Component 2")

PCA with 4 cluster K-means

cluster.pc4 <- prcomp(cluster.slice.data, center = FALSE, scale. = FALSE)$x %>% as.data.frame()
cluster.pc4$kmeans.cluster <- kclust4$cluster

mjs_plot(cluster.pc4, x=PC1, y=PC2) %>%
  mjs_point(color_accessor=kmeans.cluster) %>%
  mjs_labs(x="Principal Component 1", y="Principal Component 2")

Comparing the above two plots determines that a 5 cluster solution will be an ideal estimate for K-means clustering.

4) Visualize the different separable clusters in the data

fviz_cluster(kclust5, data = cluster.slice.data, geom = "point",
             stand = FALSE, ellipse.type = "norm") + 
  ggtitle(label='Customer Clusters')

5) Cluster Analysis

Determine the different customers belonging to each cluster

## retrieve customer ID's in each cluster
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 1,])))
## retrieve customer ID's in each cluster
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 2,])))
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 3,])))
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 4,])))
head(gather(data.frame(cluster.slice.temp[kclust5$cluster == 5,])))

6) Customer Segmentation

#Customer segmentation through aggeration of results by mean
cluster.slice.kmeans.aggregate <- aggregate(cluster.slice.data, by = list(kclust5$cluster), mean)

cluster<-c(cluster.slice.kmeans.aggregate$Group.1)
shop1<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_1)
shop2<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_2)
shop3<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_3)
shop4<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_4)
shop5<-c(cluster.slice.kmeans.aggregate$amount_purchased_shop_5)

# Plot a Bar graph
Legends <-c(rep("Customers Shop 1", 5), rep("Customers Shop 2", 5), rep("Customers Shop 3", 5), rep("Customers Shop 4", 5), rep("Customers Shop 5", 5))
values <-c(shop1,shop2,shop3,shop4,shop5)
mydata <-data.frame(cluster, values)

p <-ggplot(mydata, aes(cluster, values))
p +geom_bar(stat = "identity", aes(fill = Legends)) +
  xlab("Cluster") + ylab("Total") +
  ggtitle("Customer Segmentation") +
  theme_bw() + scale_y_continuous(labels = scales::comma)

Observations: Clustering the data based on the purchase behaviour of customers i.e. from the shops they shop the most, revealed 5 separable clusters to analyze. Cluster analysis, helped to identify the customers in each cluster based on their customer IDs. This is useful to understand the different customers that build the customer base in each cluster. Further, Customer Segmentation facilitated to identify the customers of different shops in each segment (cluster). This further helped to divide the cluster and attach meaning to it. Thus, the research question is answered by identifying five customer segments based on their purchase behaviour and further partitioning these segments by determining the specific customers belonging to the five different shops.

Usefulness: Detection of clusters can help the business to develop a specific strategy for each cluster base. Clustering can also be used to understand the purchase behaviour of customers by keeping a track of customers over months and detecting the number of customers moving from one cluster to other. This helps the business to better organize strategies to increase revenue at different shops. Customer Segmentation insights obtained can be further utilized by the business to better focus their marketing efforts on the right customers. Eg. Discounts and offers related to a particular shop can be sent to only the customers who usually purchase at the particular shop without bothering the customers of other shops. Thus, targeting the right customers for the right deals can help to cut-down the marketing costs, generate more revenue and increase customer satisfaction.

5. Which are the Top 100 customers that are most profitable in terms of revenue generation?

Algorithms selected:

Reason for Algorithm Selections:

Analysis

Observations: Thus, the research question is answered by

Business Utilization: These insights obtained can be further utilized by the business to

References

Pennacchioli, Diego, Michele Coscia, Salvatore Rinzivillo, Dino Pedreschi, and Fosca Giannotti. 2013. “Explaining the Product Range Effect in Purchase Data.” In Big Data, 2013 Ieee International Conference on, 648–56. IEEE.